Skip to content

Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS#28411

Open
velonica0 wants to merge 2 commits into
microsoft:mainfrom
velonica0:rvv_pr
Open

Add RVV (RISC-V Vector Extension) optimized convolution and pooling kernels for the NCHWc blocked format in MLAS#28411
velonica0 wants to merge 2 commits into
microsoft:mainfrom
velonica0:rvv_pr

Conversation

@velonica0
Copy link
Copy Markdown
Contributor

@velonica0 velonica0 commented May 8, 2026

Description

New kernel files:

  • riscv64/sconv_depthwise_kernel_rvv.cpp — RVV-optimized 3x3 stride-1 depthwise convolution (NCHW format), replacing the MLAS_FLOAT32X4 generic vectorized version
  • riscv64/sconv_nchwc_kernel_rvv.cpp — 7 NCHWc kernels using vfloat32m4_t (LMUL=4, BlockSize=16):
    • Direct NCHW conv (MlasConvNchwFloatKernelRvv)
    • Direct NCHWc conv (MlasConvNchwcFloatKernelRvv)
    • Depthwise NCHWc conv (MlasConvDepthwiseFloatKernelRvv)
    • Pointwise NCHWc conv (MlasConvPointwiseFloatKernelRvv)
    • Max/AvgExcludePad/AvgIncludePad pooling

Motivation and Context

Following #28261, Optimize more MLAS kernels using RISC-V Vector (RVV) extensions.

Please Note:

  • On the K3 (SpacemiT X60), VLEN=256. With LMUL=4 and e32, the hardware can hold (256/32) * 4 = 32 floats per vector register group — but we only request 16. So we're using half the available vector width.

  • The reason is that BlockSize=16 is baked into the NCHWc data layout across the whole framework (matching ARM64 NEON). Changing it to 32 would require a different NCHWc format and is not a localized change.

Benchmark ((SpacemiT K3, VLEN=256, 8-core))

All tests pass with zero numerical error.

Kernel Speedup (RVV vs scalar)
Direct NCHW Conv 1.27–1.29x
Direct NCHWc Conv 1.93–1.95x
Depthwise NCHWc Conv 10.8–12.5x
Pointwise NCHWc Conv 29.4–30.4x
Max Pooling 12.5–20.0x
Avg Pooling (exclude pad) 4.0–4.3x
Avg Pooling (include pad) 5.5–5.8x

@velonica0
Copy link
Copy Markdown
Contributor Author

Hi @hariharans29
Could you please take a look at this PR when you have a moment? I’d really appreciate your help.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends MLAS’ riscv64 RVV support by wiring up optimized float32 convolution and NCHWc pooling kernels, enabling the NCHWc blocked-format fast paths (BlockSize=16) on RVV-capable systems and replacing the previous generic depthwise implementation.

Changes:

  • Adds new riscv64 RVV kernel implementations for direct NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and max/avg pooling in NCHWc format.
  • Wires the new kernels into MLAS_PLATFORM initialization for riscv64 when RVV is available, and enables NCHWc fast-path selection for RISCV64+RVV.
  • Updates build configuration to compile the new RVV sources and swap out the previous depthwise kernel source for RISCV64 builds.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
onnxruntime/core/mlas/lib/snchwc.cpp Enables RISCV64+RVV to use platform-selected NCHWc conv/pool kernels and block size.
onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp New RVV implementations for NCHW/NCHWc conv, depthwise/pointwise NCHWc conv, and NCHWc pooling.
onnxruntime/core/mlas/lib/riscv64/sconv_depthwise_kernel_rvv.cpp New RVV 3x3 s1 depthwise CHW kernel implementation for the multiplier-1 path.
onnxruntime/core/mlas/lib/platform.cpp Registers RVV NCHWc conv/pool kernels and sets NCHWc block size for riscv64 when RVV is present.
onnxruntime/core/mlas/lib/mlasi.h Declares new RVV kernel entry points and adds RISCV64+RVV NCHWc members to MLAS_PLATFORM.
onnxruntime/core/mlas/lib/convolve.cpp Enables the depthwise-direct algorithm path on RISCV64 and updates stride restrictions comment/logic.
onnxruntime/core/mlas/inc/mlas.h Exposes MlasConvAlgorithmDepthwise in the enum for RISCV64 builds.
cmake/onnxruntime_mlas.cmake Adds new RVV kernel sources and adjusts which depthwise source is compiled for RISCV64/RVV.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/platform.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/riscv64/sconv_nchwc_kernel_rvv.cpp
@velonica0
Copy link
Copy Markdown
Contributor Author

Both comments point to the same concern: "What if $vl < 16$?"

In practice, however, this is not an issue because the minimum VLEN on current RISC-V CPUs is 128 bits. Therefore, this case is effectively covered (though I have implemented changes regardless).

The calculation is as follows:

__riscv_vsetvl_e32m4(avl) returns $\min(avl, VLMAX)$, where:

$$VLMAX = (VLEN / SEW) \times LMUL = (VLEN / 32) \times 4$$

@hariharans29
Copy link
Copy Markdown
Member

Both comments point to the same concern: "What if v l < 16 ?"

In practice, however, this is not an issue because the minimum VLEN on current RISC-V CPUs is 128 bits. Therefore, this case is effectively covered (though I have implemented changes regardless).

The calculation is as follows:

__riscv_vsetvl_e32m4(avl) returns min ( a v l , V L M A X ) , where:

V L M A X = ( V L E N / S E W ) × L M U L = ( V L E N / 32 ) × 4

Thanks for the clarification. It looks good to me. Let me get one last opinion from Copilot.

@hariharans29 hariharans29 requested a review from Copilot May 11, 2026 17:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

#include "kleidiai/mlasi_kleidiai.h"
#endif

#if defined(MLAS_USE_RVV)
acc = __riscv_vfadd_vv_f32m4(acc, old_output, vl);
}

if (KernelFlags & MLAS_CONV_KERNEL_FLAG_BIAS_ADDITION) {
Comment on lines +463 to +470
vfloat32m4_t sum_vec = __riscv_vfmv_v_f_f32m4(0.0f, vl);
uint32_t valid_count[BlockSize];

if (ExcludePad) {
for (size_t i = 0; i < BlockSize; i++) {
valid_count[i] = 0;
}
}
float results[BlockSize];
__riscv_vse32_v_f32m4(results, sum_vec, vl);
for (size_t i = 0; i < BlockSize; i++) {
results[i] /= static_cast<float>(valid_count[i]);
Comment on lines +136 to +138
const size_t pad_top = Parameters->Padding[0];
const size_t pad_left = Parameters->Padding[1];
const size_t pad_right = Parameters->Padding[3];
Comment on lines +177 to +187
if (pad_left && ow < out_cols) {
const ptrdiff_t iw = static_cast<ptrdiff_t>(ow) - static_cast<ptrdiff_t>(pad_left);
float acc = DepthwiseComputeEdge(
row0, row1, row2, iw, W,
w00, w01, w02, w10, w11, w12, w20, w21, w22
);
if (accumulate_output) {
acc += beta * out_row[ow];
}
out_row[ow++] = acc;
}
out_row[ow++] = acc;
}

if (pad_right && ow < out_cols) {
Comment on lines +200 to +221
for (size_t kh = 0; kh < KernelHeight; kh++) {
for (size_t kw = 0; kw < KernelWidth; kw++) {
const float* input_base = Input + output_idx * StrideWidthElements +
kh * DilatedInputWidthElements + kw * DilationWidthElements;

const float* input_row_start = InputBase + kh * DilatedInputWidthElements;
const float* input_row_end = input_row_start + InputWidthElements;

size_t kernel_pos = kh * KernelWidth + kw;

for (size_t ic = 0; ic < BlockSize; ic++) {
const float* input_element = input_base + ic;

float input_value = 0.0f;
if (input_element >= input_row_start && input_element < input_row_end) {
input_value = *input_element;
}

size_t filter_offset = kernel_pos * BlockSize * BlockSize + ic * BlockSize;
vfloat32m4_t filt = __riscv_vle32_v_f32m4(&filter[filter_offset], vl);
acc = __riscv_vfmacc_vf_f32m4(acc, input_value, filt, vl);
}
@hariharans29
Copy link
Copy Markdown
Member

Copilot generated a lot more comments in this round - can you please take a look ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants